library(tidyverse)
library(RMySQL)
library(lubridate)

Front matter June 3, 2020 at 4:59PM.

Name your files applied_ps_3.Rmd and applied_ps_3.html. (5 pts.)

Follow the style guide (10 pts.)

This submission is our work alone and complies with the 30535 integrity policy.

Add your initials to indicate your agreement: JCI

Add names of anyone you discussed this problem set with: **__**

Submit by pushing your code to your repo on Github Classroom: https://classroom.github.com/g/5vDiXToZ.

Late coins used this pset: X. Late coins left: X.

waze data

Prelim questions

  1. Have you deleted any Waze data that you downloaded onto your computer (answer this at the end of the problem set in the affirmative)?

1 Waze data start-up (5 points)

Working with data on a server adds a challenge as you have to make calls to the database which take time to process. A call to the database can be slow for several reasons.

  1. the data you are trying to pull is very large.
  2. many people are making requests at the same time.
  3. something else is going on.

We can adjust for 1 and 2 by testing our code on small subsections of the data.

Next week we will provide an opt out where you can use csv we provide. Using this option will result in a 10 percent discount on your problem set final grade. For example, if you earn \(90\) pts based on your solutions, your final grade will be \(90 \cdot .9 = 81\).

  1. Which of the following methods will cause problems as you develop your solutions?
    1. Use filter() to reduce the amount of data you pull while exploring data. For example, you can filter by time and location to only get data for a small part of the city and/or over a short time period.
    2. collect() a small sample data set so that the you have data in memory on your computer.
    3. collect() the entire data set each time you want to work with it.
  1. collect() will cause problems, because each time we call it, the server needs to wake up and send us the entire data set, which is quite large. This will take a lot of time and bandwidth. Instead, we should use functions like filter and smaller collect() pulls to use the data more efficiently.
  1. As is the case with any data set, Waze has to make decisions about what data to store and how to measure it. Review the data documentation and the rest of the problem set. Propose a variable that Waze could feasibly track that is not available now or feasible and better way a to measure a variable currently in the dataset. Support your proposal with reasoning.

I would like to see Waze predict when draw bridges will raise and lower to allow ships to pass under. This is a major problem for people who have to commute over those bridges because it’s so unpredictable. It can easily add 20+ minutes to a commute. As is, Waze doesn’t know that bridges will raise until traffic has stopped and Wazers report it. But if Waze could cooperate with the authorities that control bridges (including cities, states, and the army corps of engineers) they could warn drivers in advance of these delays, allowing users to seek alternate routes. The challenge for adding this new variable would be coordinating with these authorities to generate and share the data; there are likely to be many agencies involved in maintaining these bridges, and some of them may not have the infrastructure that would be needed to report this data. But another approach might be to use live marine traffic data like https://www.marinetraffic.com/ to “guess” when a bridge might need to be raised to allow a ship under.

  1. As is the case with most consumer data, Waze users are self-selecting. Write a few detailed sentences about how you think self-selection influences what data is present. Waze users are likely self selecting in that they are tech-savvy people, which makes me think they are probably younger, more affluent, and more urban than the general population. This could cause them to overreport traffic in cities, while data in more rural areas is more sparse.

2 Waze vision zero (15 points)

Read up on the ggmap package, which will be useful for doing these problems. Particularly, get to know the get_stamenmap() function. If you find yourself downloading 1000s of tiles, check your settings. You are welcome to try using google basemaps as well; while free for new users, this will require a credit card. The version of ggmap on CRAN is out of date, instead find and install it from github.

  1. Look at Vision Zero Neighborhood High Crash Corridor #7. Plot the accidents in this corridor on a map.
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library(RMySQL)
library(stringr)
connection <- DBI::dbConnect(RMySQL::MySQL(),
                      user = "ppha30531",
                      dbname="Waze2",
                      port = 3306,
                      password = "bUYjwnKXf49M2pb",
                      host = "uchicagowazereplica2.cfykgneqoh8w.us-west-2.rds.amazonaws.com"
)

# Show which tables are available from the Waze database
DBI::dbListTables(connection)
## [1] "chiAlerts"
# Create table referencing the data in the chiAlerts table
# of the waze database
chi_alerts_sql <- tbl(connection, "chiAlerts")
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
#Y bounds: 41.935, 41.896 (W George St to W Chicago Ave
event_data <- chi_alerts_sql %>%
  filter(str_detect(street, "N Western Ave"),
        41.896 <= location_y & location_y <= 41.934, 
        type == "ACCIDENT") %>%
  collect()
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric

## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
# Save collected data as an rds file, access and read it
saveRDS(event_data, file = "event_data.rds")
event_data <- readRDS(file = "event_data.rds")

# View active connections
dbListConnections(MySQL())
## [[1]]
## <MySQLConnection:0,0>
# End active connections
lapply( dbListConnections(MySQL()), function(x) dbDisconnect(x) )
## [[1]]
## [1] TRUE
event_data%>%
  group_by(street)%>%
  summarize(n())
## # A tibble: 1 x 2
##   street        `n()`
##   <chr>         <int>
## 1 N Western Ave   315
corridor_7 <- c(left = -87.7, bottom = 41.899 , right =
  -87.67, top = 41.934)

corridor_7_stamenmap <- get_stamenmap(data = event_data, 
              bbox = corridor_7, 
              zoom = 15, 
              maptype = "toner-lite")
## Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
corridor_7_map <- ggmap(corridor_7_stamenmap,
                        base_layer = ggplot(data = event_data)
                        )

corridor_7_map

corridor_7_map + 
  geom_point(aes(location_x, location_y), color = "red")
## Warning: Removed 12 rows containing missing values (geom_point).

  1. Around what intersection are accidents most common? Use Google Street View to look at this intersection. Do you see any problems? The intersection where accidents are most common is at the instersection of N Western Ave, We Logan Ave, Kennedy Western Rd and Kennedy Fullteron Rd. This intersection is extremely hard to navigate because the streets don’t intersect in a predictable way. It is difficult for drivers to know which lanes go to which road, and also to know which lanes are available to other cars. Making matters worse, the Windy City Field House parking lot has an outlet that feeds into the intersection, creating yet another unexpected point of entry to the intersection.

3 Transit Oriented Development (15 points)

  1. On October 21, the City of Chicago declared (https://www.cityofchicago.org/city/en/depts/mayor/press_room/press_releases/2018/october/5Million_TransitOriented_Development_HighRidership_Bus_Corridors.html) the 79 and 66 bus routes as areas of focus for transit oriented development. The City says the plan addresses bus “slow zones”. Note: Watch out for “179th St”.
  1. For each corridor, plot traffic alerts by time of day.
connection <- DBI::dbConnect(RMySQL::MySQL(),
                      user = "ppha30531",
                      dbname="Waze2",
                      port = 3306,
                      password = "bUYjwnKXf49M2pb",
                      host = "uchicagowazereplica2.cfykgneqoh8w.us-west-2.rds.amazonaws.com"
)

chi_alerts_sql <- tbl(connection, "chiAlerts")
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
event_data <- chi_alerts_sql %>%
  filter(street %in% c("E 79th St", "W 79th St", "Chicago Ave","E Chicago Ave", "W Chicago Ave")) %>%
  collect()
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric

## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
# Save collected data as an rds file, access and read it
saveRDS(event_data, file = "event_data.rds")
event_data <- readRDS(file = "event_data.rds")

# View active connections
dbListConnections(MySQL())
## [[1]]
## <MySQLConnection:0,1>
# End active connections
lapply( dbListConnections(MySQL()), function(x) dbDisconnect(x) )
## [[1]]
## [1] TRUE
event_data%>%
  group_by(street)%>%
  summarize(n())
## # A tibble: 5 x 2
##   street        `n()`
##   <chr>         <int>
## 1 Chicago Ave    3222
## 2 E 79th St     37827
## 3 E Chicago Ave  2293
## 4 W 79th St     13842
## 5 W Chicago Ave 29932
event_data <- event_data%>%
  mutate("corridor" = ifelse(street %in% c("E 79th St", "W 79th St"), "79th St", "Chicago Ave"),
         event_date_time = as.POSIXct (pubMillis/1000, origin="1970-01-01")
         )

ggplot(event_data, 
      aes(hour(event_date_time), 
      fill = corridor)) + 
  geom_bar(position = "dodge") 

  1. Using a reasoned approach, choose two additional corridors for comparison.
  2. What corridors did you choose and why? I chose Routes 87 and 70 because they run parallel to the 79 and 66 a few blocks away and have a similar east-west range
  1. Make comparison plots.
connection <- DBI::dbConnect(RMySQL::MySQL(),
                      user = "ppha30531",
                      dbname="Waze2",
                      port = 3306,
                      password = "bUYjwnKXf49M2pb",
                      host = "uchicagowazereplica2.cfykgneqoh8w.us-west-2.rds.amazonaws.com"
)

chi_alerts_sql <- tbl(connection, "chiAlerts")
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
event_data_comparison <- chi_alerts_sql %>%
  filter(street %in% c("E 87th St", "W 87th St", "E Division St", "W Division St", "Division St")) %>%
  collect()
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric

## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
# Save collected data as an rds file, access and read it
saveRDS(event_data_comparison, file = "event_data.rds")
event_data_comparison <- readRDS(file = "event_data.rds")

# View active connections
dbListConnections(MySQL())
## [[1]]
## <MySQLConnection:0,2>
# End active connections
lapply( dbListConnections(MySQL()), function(x) dbDisconnect(x) )
## [[1]]
## [1] TRUE
event_data_comparison%>%
  group_by(street)%>%
  summarize(n())
## # A tibble: 5 x 2
##   street        `n()`
##   <chr>         <int>
## 1 Division St     726
## 2 E 87th St      1043
## 3 E Division St    78
## 4 W 87th St     10818
## 5 W Division St 21531
event_data_comparison <- event_data_comparison%>%
  mutate("corridor" = ifelse(street %in% c("E 87th St", "W 87th St"), "87th St", "Division St"),
         event_date_time = as.POSIXct (pubMillis/1000, origin="1970-01-01")
         )

ggplot(event_data_comparison, 
      aes(hour(event_date_time), 
      fill = corridor)) + 
  geom_bar(position = "dodge") 

a. Looking beyond traffic, what other alerts are very common in this area? Do you think these alerts would slow down the 66 / 79? If so, what steps could the City take to address the issues?

ggplot(event_data, 
      aes(hour(event_date_time), 
      fill = type)) + 
  geom_bar(position = "dodge") 

Besides traffic jams, weather hazards and accidents are also common. These would definitely slow down traffic for the buses. There also seems to be a case where 79th street was closed, triggering many event reports.

4 Waze single event (20 point)

  1. Revisit the event which caused c5a73cc6-5242-3172-be5a-cf8990d70cb2.
connection <- DBI::dbConnect(RMySQL::MySQL(),
                      user = "ppha30531",
                      dbname="Waze2",
                      port = 3306,
                      password = "bUYjwnKXf49M2pb",
                      host = "uchicagowazereplica2.cfykgneqoh8w.us-west-2.rds.amazonaws.com"
)

chi_alerts_sql <- tbl(connection, "chiAlerts")
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
single_event_data <- chi_alerts_sql %>%
  filter(uuid == "c5a73cc6-5242-3172-be5a-cf8990d70cb2") %>%
  collect()
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric

## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
# Save collected data as an rds file, access and read it
saveRDS(event_data_comparison, file = "single_event_data.rds")
single_event_data <- readRDS(file = "single_event_data.rds")

# View active connections
dbListConnections(MySQL())
## [[1]]
## <MySQLConnection:0,3>
# End active connections
lapply( dbListConnections(MySQL()), function(x) dbDisconnect(x) )
## [[1]]
## [1] TRUE
event_data_comparison
## # A tibble: 34,196 x 19
##    country nTHumbsUp city  reportRating confidence reliability type  uuid 
##    <chr>       <int> <chr>        <int>      <int>       <int> <chr> <chr>
##  1 US              0 Chic…            4          0           6 WEAT… 6bfc…
##  2 US              0 Chic…            4          0           6 WEAT… 6bfc…
##  3 US              0 Chic…            4          0           6 WEAT… 6bfc…
##  4 US              0 Chic…            4          0           6 WEAT… 6bfc…
##  5 US              0 Chic…            3          0           5 WEAT… bdfc…
##  6 US              0 Chic…            4          0           6 WEAT… 6bfc…
##  7 US              0 Chic…            3          0           5 WEAT… bdfc…
##  8 US              0 Chic…            4          0           6 WEAT… 6bfc…
##  9 US              0 Chic…            3          0           5 WEAT… bdfc…
## 10 US              0 Chic…            1          0           5 WEAT… 870c…
## # … with 34,186 more rows, and 11 more variables: roadType <int>,
## #   magvar <int>, subtype <chr>, street <chr>, location_x <dbl>,
## #   location_y <dbl>, pubMillis <dbl>, reportDescription <chr>,
## #   scrape_dt <chr>, corridor <chr>, event_date_time <dttm>
  1. Define a bounding box around the cause of the event.
connection <- DBI::dbConnect(RMySQL::MySQL(),
                      user = "ppha30531",
                      dbname="Waze2",
                      port = 3306,
                      password = "bUYjwnKXf49M2pb",
                      host = "uchicagowazereplica2.cfykgneqoh8w.us-west-2.rds.amazonaws.com"
)

chi_alerts_sql <- tbl(connection, "chiAlerts")
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
single_event_data <- chi_alerts_sql %>%
  filter(-87.624138 <= location_x, location_x <= -87.612916,
          41.857933 <= location_y, location_y <= 41.867) %>%
  collect()
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric

## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
# Save collected data as an rds file, access and read it
saveRDS(event_data_comparison, file = "single_event_data.rds")
single_event_data <- readRDS(file = "single_event_data.rds")

# View active connections
dbListConnections(MySQL())
## [[1]]
## <MySQLConnection:0,4>
# End active connections
lapply( dbListConnections(MySQL()), function(x) dbDisconnect(x) )
## [[1]]
## [1] TRUE
event_data_comparison
## # A tibble: 34,196 x 19
##    country nTHumbsUp city  reportRating confidence reliability type  uuid 
##    <chr>       <int> <chr>        <int>      <int>       <int> <chr> <chr>
##  1 US              0 Chic…            4          0           6 WEAT… 6bfc…
##  2 US              0 Chic…            4          0           6 WEAT… 6bfc…
##  3 US              0 Chic…            4          0           6 WEAT… 6bfc…
##  4 US              0 Chic…            4          0           6 WEAT… 6bfc…
##  5 US              0 Chic…            3          0           5 WEAT… bdfc…
##  6 US              0 Chic…            4          0           6 WEAT… 6bfc…
##  7 US              0 Chic…            3          0           5 WEAT… bdfc…
##  8 US              0 Chic…            4          0           6 WEAT… 6bfc…
##  9 US              0 Chic…            3          0           5 WEAT… bdfc…
## 10 US              0 Chic…            1          0           5 WEAT… 870c…
## # … with 34,186 more rows, and 11 more variables: roadType <int>,
## #   magvar <int>, subtype <chr>, street <chr>, location_x <dbl>,
## #   location_y <dbl>, pubMillis <dbl>, reportDescription <chr>,
## #   scrape_dt <chr>, corridor <chr>, event_date_time <dttm>
single_event_bounds <- c(left = -87.624138, bottom = 41.857933 , right =
  -87.612916, top = 41.867)

single_event_stamenmap <- get_stamenmap(data = event_data, 
              bbox = single_event_bounds, 
              zoom = 16, 
              maptype = "toner-lite")
## Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
single_event_map <- ggmap(single_event_stamenmap,
                        base_layer = ggplot(data = single_event_data)
                        )

single_event_map

single_event_map + 
  geom_point(aes(location_x, location_y), color = "red")
## Warning: Removed 34196 rows containing missing values (geom_point).

  1. What causes all these jams? Some googling might help.
single_event_data <- single_event_data%>%
  mutate(event_date_time = as.POSIXct (pubMillis/1000, origin="1970-01-01"))
  1. Plot the number of jams 6AM-6PM CST. Why are there two humps?

  2. Place one vertical line at each hump.

  3. Next, propose a quantitative measure of traffic jam severity that combines the number of traffic JAM alerts with information in the subtype variable.

  4. Plot this measure from 6AM-6PM CST. Is there any information that is conveyed by your severity measure that was not captured by plotting the number of jams? If so, what is it?

5 Waze aggregate over multiple events (30 points)

  1. Pick one major accident. What is the uuid? Sample alerts from the two hours before the accident first appeared in the data and two hours after the accident for a geographic box of 0.1 miles around the accident. Make a plot where the y-axis is the number of traffic jam alerts and the x-axis is the five-minute interval from two hours before the accident to two hours after the accident. Warning: This question is harder than it first appears. You might want to review R4DS chapter 12.5 (lecture note 5) on missing values and chapter 16.4 (lecture note 9).

  2. Building on your work for the prior question, write a function that takes as its arguments uuid, a date-time, a latitude and a longitude and returns a data frame with the number of alerts in each five-minute interval from two hours before to two hours after.

  3. Make a data frame with every major accident on Nov 20, 2017. Feed each row of this data frame to your function. Collapse the output into the mean number of traffic jam alerts in each five-minute interval in the two hours before the accident and two hours after the accident for a geographic box of 0.1 miles. Tip: This may take upwards of 20 minutes to run on all major accidents. Use your function on a small sample of accidents first to make sure your code is working as expected before trying to run on all accidents.

  4. Plot the mean number of jam alerts around major accident. To be clear, the correct answer here is a single plot that summarizes jams across major accidents, not one plot for each accident. Congratulations! This is your first event study.